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ABSTRACT 


Objective: The main objective of this project is to explore and analyse a secondary dataset which collected from “Hospital Uni- 


versitario de Caracas” in Caracas, Venezuela. 


Methods: The dataset comprises 858 patients’ information relating to demographic information and medical history data. There 
is a large number of records which are left with blank, which might be intentionally avoided by the patient due to privacy con- 
siderations. SAS Studio is utilized in data exploration and data pre-processing. Data cleaning and data transformation are con- 
ducted basing on the knowledge gathered in the process of data exploration. Afterwards, the dataset was exported from SAS 
Studio and uploaded to Hadoop Hortonworks platform for analysing purpose. Lastly, five hypotheses have been explored with 


the visualization tool of Tableau. 
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INTRODUCTION 


According to the stats from the World Health Organization, 
cervical cancer is recognized as the fourth most common 
cancer-causing woman mortality!. There are many effective 
methods to prevent and detect cervical cancer in early stage, 
such as cytology-based cervical cancer screening, visual in- 
spection with Acetic Acid Application (VIA) as an alterna- 
tive in developing countries, HPV test in cervical screening”. 
However, due to limited infrastructure, lack of skilled medi- 
cal professionals in place and low public awareness of the 
risk of this disease, it is still causing high mortality rate in 
developing countries such as the countries in Africa, Latin 
American countries and parts of Asia. 


There are many factors which will lead to high chance of cer- 
vical cancer infection, including but not limited to: (1) Human 
Papilloma Virus (HPV), 99.7% of the cervical cancer infec- 
tion is linked with HPV infection (Chichareon, 1998). (2) sex- 
ual partner numbers (3) individual immunity (4) count of birth 
deliveries (5) tobacco and (6) birth controlling device usage’. 


In this report, will use one of the secondary dataset collected 
from a Latin American country, “Hospital Universitario de 
Caracas” in Caracas, Venezuela. There are 858 records in the 
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dataset, which comprises some demographic information, and 
historical medical records. There is a large number of records 
which are left with blank, which might be intentionally avoid- 
ed by the patient due to privacy considerations. First, will 
study about the privacy and security issue in the healthcare 
industry and some techniques to preserve data security in Sec- 
tion 2. Then, data is imported into SAS for data exploration 
and pre-processing, this will be talked about in Section 3, 4 
and 5. Next, to leverage big data computational power, the 
dataset is pumped into Hadoop and stored as Optimized Row 
Columnar (ORC) file, this is shown in Section 6. In Section 7, 
the dataset from Hadoop will be linked with the visualization 
tool, Tableau to analysis five hypotheses. Finally, in Section 8, 
there is a discussion and conclusion part. 


RELATED WORKS 


Privacy and security concern in healthcare 

The digitalization of healthcare data is becoming a preva- 
lent approach for major hospitals and clinics. There are evi- 
dent benefits such as reduction of cost, and rapid delivery 
of healthcare services (physicians are not bond to location 
any more), etc. In the meantime, it has also arisen public 
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concern in regards to privacy and security issue. Accord- 
ing to a survey conducted in the US (2013), 69% of the pa- 
tients hold privacy concern about their medical record since 
healthcare providers deployed digital record’. In response to 
this, they will either give faulty information or avoid giving 
some of the information. This will jeopardize the veracity of 
the source of data and further impact prediction modelling 
accuracy. 


The issue of privacy and security in healthcare has been of 
concern dated back since the 20" century. In 1996, the US 
has issued the Health Insurance Portability and Accountabil- 
ity Act (HIPAA), to safeguard the medical record privacy 
with the power of federal law’. It requires health entities to 
set up proper administrative and technical protections to en- 
sure data confidentiality and security. Besides, it also permits 
the individual to require full disclosure of its personal health- 
care information from healthcare entities. Besides this, there 
are many other acts published in other regions and countries, 
for example, the Data Protection Directive in the EU. De- 
spite all these, even though sometimes data are removed 
with identifiers (de-identified), it is still possible to locate 
individual unique information with some external dataset to 
assist’. In case of sensitive information like healthcare data 
is leaked out, it will cause a huge loss to enterprises as well 
as individuals. 


On the other hand, with the digitalization of medical infor- 
mation, medical data can be transferred and shared with 
other professionals easily, this has facilitated the improve- 
ment of the medical industry. This has brought benefit to the 
patient for better diagnose and stimulated the breakthroughs 
in the medical field. 


In consideration of this, this is a dilemma to find a balance 
between the protection of privacy and digitalization in the 
healthcare industry. 


Recent Development in Healthcare 


(1) Cloud Computing 


Due to the nature of healthcare-related data, it 1s the cross- 
boundary area between computer technology, biology and 
biomedical science. Hence, datasets of the healthcare field 
are usually characterized as vast volume, various type (in- 
cluding structured and unstructured). To better assist with 
this, big data and cloud computing is advocated as a solu- 
tion®. Cloud computing is enjoying its inherited advantages 
such as the extendable capacity of storage and fast process- 
ing, reduced cost for maintenance, convenience for data 
sharing with different research centres, etc. However, the 
adoption of cloud is always accompanied by the concern in 
privacy and security. 








(2) Mobile Device Applications 


The use of smartphone and other mobile devices are getting 
ubiquitous. This allows the hospital to provide continuous 
monitoring over to the patients and can diagnose and deliv- 
ery interventions remotely. At the same time, it enables doc- 
tors to share condition and graphical information about the 
patient with other specialists from whom they can communi- 
cate, and exchange ideas then can get informed with the op- 
timal treatment choice across the world. This is an effective 
method for driving improvement in the medical field. The 
research did in 2013 shows that more than half of the health- 
care professionals are working either from a smartphone or 
tablet. However, despite all the merits it conveys, 69% of 
the patients are showing privacy and security concern about 
utilizing mobile devices for healthcare information delivery. 


Approaches for privacy-preserving 

The rapid development and application of new technologies 
in healthcare surely improved the healthcare services perfor- 
mance, but potentially it also increased the threat of privacy 
and security breaches. To prevent improper usage of this sen- 
sitive data, the healthcare entity could deploy some physical 
and electronic methods to enhance the security level. 


(1) Encryption 


Encryption by definition is to transform the plain understand- 
able text into ciphertext, which cannot be understood by peo- 
ple without authorization. Encryption is based on symmetric 
and asymmetric algorithms. For the symmetric approach, a 
secret key is utilized to generate cipher text and vice versa, 
when decrypting cipher text back to plain text. While for the 
asymmetric approach, a pair of security key is generated. 
The decryption key is kept in secret and the public key will 
be sent to whoever needs the authorization to access data. 


(2) Authorization control 


Authorization control is referring to controlling users’ access 
to the availability of the data. Under the different authoriza- 
tion, the user will be granted different access authorize and 
they will be under the governess when retrieve or modify in- 
formation in the database. By this way, this technique could 
protect the integrity and confidentiality of data source. There 
are many technologies currently been pervasively used: 





a) Biological method: by scanning a fingerprint, iris, fa- 
cial expression, or recognition of voice, signature etc. 

b) Mandatory Access Control (MAC): This controls the 
information disclosure extent user can access by des- 
ignating security roles. It will classify the sensitivity 
level of the information and classify authorization of 
the subject, then assign authorization to the subject. 
This is a highly restricted and confidential way, which 
is being used for maintaining military information as 
well. 
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c) Password: This is a simple and most widely used ap- 
proach. 


(3) Data Masking 


Data masking is also called de-identifying, which means to 
remove the individual unique identifiers such as name, or se- 
curity number from the dataset. A most common technique 1s 
K-anonymity. Which means when releasing a record, it must 
have at least (k-1) other records whose values are indistinct 
even with external resource. By doing this, k-anonymity 
makes it difficult to recognize the identity of the individuals 
in the dataset. However, a drawback of this technique is it 
will face the problem when anonymizing high dimensional 
dataset?. 


(4) Laws and Regulations 


Healthcare providers are under compulsory to conform to 
rules and regulations. With the revolution of transiting paper 
medical record to paperless electronic medical record, this 
regulatory compliance has become a major focus from the 
public’. There are many laws and regulations in place, such 
as the Health Insurance Portability and Accountability Act 
(HIPAA) of the US and the constitution published by Brazil 
in 1988. Failure to conform to these regulations will lead to 
civil and criminal penalties. 





Initial Data Exploration 


Type of Each Attribute 
According to the metadata, there are 36 attributes in the data- 
set, types of each attribute are identified as follows (Table 1): 


Table 1: Types of each attribute 


No Variables Attributes Types 
1 Age Ratio 

ž Number of sexual partners Ratio 

3 First sexual intercourse Ratio 

4 Number of pregnancies Ratio 

5 Smokes Nominal 
6 Smokes (years) Ratio 

7 Smokes (packs/year) Ratio 

8 Hormonal Contraceptives Nominal 
9 Hormonal Ratio 

Contraceptives(years) 

10 IUD Nominal 
11 IUD (years) Ratio 

12 STDs Nominal 
13 STDs (number) Ratio 

14 STDs: condylomatosis Nominal 
15 STDs: cervical condylomatosis Nominal 


16 STDs: vaginal condylomatosis Nominal 
17 STDs: vulvo-perineal condylo- Nominal 
matosis 
18 STDs: syphilis Nominal 
19 STDs: pelvic inflammatory Nominal 
diseases 
20 STDs: genetical herpes Nominal 
a STDs: molluscum contagiosum Nominal 
22. STDs: AIDS Nominal 
23 STDs: HIV Nominal 
24 STDs: Hepatises B Nominal 
25 STDs: HPV Nominal 
26 STDs: Number of diagnoses Ratio 
27 STDs: Time since the first Ratio 
diagnosis 
28 STDs: Time since the last Ratio 
diagnosis 
29 Dx: Cancer Nominal 
30 Dx: CIN Nominal 
31 Dx: HPV Nominal 
32 DX Nominal 
23 Hinselmann Nominal 
34 Schiller Nominal 
35 Citology Nominal 
36 Biopsy Nominal 


Dataset Understanding 


Importing Dataset 

Upload the dataset onto SAS server, then import data into 
Library with codes as below. The imported dataset 1s named 
as “cervical cencer”, Fig 1. Output data, as well as table of 
contents, are shown in Fig 2 and Fig 3. The variables are 
shown in Fig 4. 


PROC MEANS 

PROC MEANS procedure is a descriptive statistic analyz- 
ing tool for understanding data summarization. By default, 
the output report shows the number of observations, mean, 
standard deviation, min and max of the group of values. As 
in Fig.5, it can be further extended to get the output of other 
statistical analysis values’. 





From below Fig.7, we can tell the sample size, central ten- 
dency (mean, median), range (min, max), variety (standard 
deviation), dispersion (1* quartile and 3™ quartile), and also 
the number of missing value. This gives a rough view of the 
data distribution and representative value, however, for nom- 
inal attributes it does not make too much sense to calculate 
the mean, median, etc. Will use PROC FREQ and PROC 
UNIVARIATE to further understand the frequency and dis- 
persion for categorical variables later. Another finding 1s, 
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there is a large amount of incomplete value for information 
related to the historical medical record, which will impact 
model performance. Hence, missing values are expected to 
be handled before analyzing phase. 





PROC FREQ 

PROC FREQ procedure provides an analysis function to un- 
derstand the frequency, percentage, cumulative frequency, as 
well as cumulative percentage of each variable. It can also 
be utilized to stratify data and plot cross-table between two 
variables. Below is the code for getting PROC FREQ output. 


PROC UNIVARIATE 

PROC UNIVARIATE procedure provides descriptive sta- 
tistic measures for understanding numeric data. Comparing 
with PROC MEANS, UNIVARIATE can also indicate skew- 
ness and kurtosis, frequency table, and also extreme values. 
It can also be used together with the VAR statement and plot 
statistic summary. Below is the code for getting UNIVARI- 
ATE output for all variables. 


Summarization of Each Variable 
1). Age 


As can see from Fig.6, total observation number of Age is 
858, there is no missing value in Age. Mean, Median and 
mode are provided as 26.82, 25, and 23 respectively. Slightly 
positive skewness and positive kurtosis tendency are detect- 
ed. Standard deviation (8.50) is less than the mean (26.82). 
Interquartile range (Q3 — Q1) is 12, hence the upper bound- 
ary for a suspected outlier is 50 (Q3 + 1.5*IQR), and lower 
boundary for suspected outlier is 2 (Q1 — 1.5*IQR). 


Fig.7 depicted distribution histogram of Age. The most 
frequent value appears around 22.5, and the distribution 
is showing slightly right-skewed and many data dispersed 
along X-axis. 


Fig.8 is the boxplot of Age. The box and whisker plotted out 
the range (bottom and top whisker), Interquartile range (be- 
tween 2" and 4" whisker), as well as median (3™ whisker) 
and mean (diamond). Besides, outliers are also plotted as 
dots above the upper whisker. 


The findings of visualization graphs match up with what can 
tell from the descriptive output. 





B setup x Ae 


xo- BRBEBAOG*FKR Ox H wR 


/* Generated Code (IMPORT) */ 
factors_ce 


- /* Source File: risk_ ors_cervical_cancer - Original.csv */ 
i E /* Source Path: /home/tp04s6e20 */ 
PF EN reste /* Code generated on: 8/6/18, 11:11 AM */ 
Pma Xweb_drop_table(WORK.cervical_cancer); 
F) FILENAME REFFILE '/home/tp8486020/risk_factors_cervical_cancer - Original.csv'; 
dd PROC IMPORT DATAFILE=REFFILE 
>a 
ba OUT=WORK.cervical_can 
p wE GETNAMES=YES; 
— RUN; 
“w 
> RB CERVICA N PROC CONTENTS DATAsWORK.cervical cancer; RUN;| 
Xweb_open_table(WORK.cervical_ cancer); 


Figure 1: Import dataset into SAS studio. 
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Data Set Name 
Member Type 
Engine 
Created 

Last Modified 
Protection 
Data Set Type 
Label 


Data Representation 


Encoding 


The CONTENTS Procedure 


WORK.CERVICAL_CANCER 
DATA 

v9 

08/06/2018 11:13:20 
08/06/2018 11:13:20 


utf-8 Unicode (UTF-8) 


Figure 2: Output of Content Procedure. 


Data Set Page Size 


Engine/Host Dependent Information 


131072 


Number of Data Set Pages 2 


First Data Page 
Max Obs per Page 
Obs in First Data Page 


1 
511 
484 


Number of Data Set Repairs 0 
/saswork/SAS_workEAB600001E50_odaws01-prod-sg/SAS_work1F8200001E50_odaws01-prod-sg/cervical_cancer.sas7bdat 


Filename 

Release Created 
Host Created 

Inode Number 
Access Permission 
Owner Name 

File Size 

File Size (bytes) 


Figure 3: 
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Figure 4: 


9.0401M5 
Linux 
536884559 
IW-f--f-- 
tp0486020 
384KB 
393216 


SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 


Observations 
Variables 

Indexes 

Observation Length 
Deleted Observations 
Compressed 

Sorted 


Output of Engine/Host Dependent Information. 


Alphabetic List of Variables and Afiributes 


Variable 

Age 

Biocrsy 

Citohogpy 

Dex 

Dee: (CIN 

Dre: (Carer 

Dee: HPA 

First seaual intercourse 
Hinseinann 

Hormonal Contraceptives 
Hormonal Contraceptives (years) 
IUD 

IUD {years 

Num of pregnandes 

Number of seus! partners 
STs 

STDs (numer 

STDs: Number of diagnosis 
SDs: Time sinoe first diagnosis 
STDs: Tine sino last diagnoses 
SOs 0s 

STs HIY 

STDsHPW 

STOs:-Hepattis E 

STDscevica] condbvomesiosis 
STOsscondylorres ines 

ST Osxgesnitzl herpes 

SOs snaiuscum comiagiasum 
STOspevic rfiemmratory disease 
STO0s=yphile 

STOswaginal condyomatoss 
STOsswuivo-oeinesal combdomtas 
Schulker 

Smokes 

Smokes (paksie | 


Smokes (years) 


TYP? 
Num 
Nurm 
Nurn 
Nurm 
Nunn 
Num 
Murn 
Char 
Murn 
um 
Mur 
Char 
Char 
Char 
Nur 
Murn 
Nurn 
Nur 
Char 
Char 
Nur 
Nur 
Nurn 
Nurm 
Murn 
Mur 
Nur 
Nur 
Hur 
um 
Nurm 
Murn 
Murn 
Murn 
Nur 


Hurm 


Output of Attribute List. 


Len 
5 
a 


a 


Format 
BEST12 
BEST 12 
BEST12 
BEST12 
BEST 12 
BEST 12 
BEST12 
zd. 

BEST12 
BEST12 
BEST12 


3 wi vI 
LS 


e 

BEST12 
BEST12 
BEST12 
BEST12 


it vI 


BEST12 
BEST 12 
BESTi2 
BEST 12 
BEST12 
BEST12 
BEST 12 
BEST 12 
BEST12 
BEST12 
BEST12 
BEST12 
BEST12 
BEST12 
BEST12 
BEST12 


Informat 
BEST32 
BESTS 
BEST X2 
BESTI2 
BEST 
BESTS 
BESTS2 
za. 
BEST32 
BESTS2 
BEST X2 


pI it pi 
t 


BESTI2 
BESTI 
BESTI? 
BEST32 


(PA (EA 


BEST 
BEST2 
BESTA? 
BEST32 
BEST32 
BEST32 
BESTA? 
BEST 
BEST32 
BESTA? 
BEST32 
BEST32 
BEST 
BEST% 
BEST32 
BEST2 


858 
36 


256 
0 

NO 
NO 
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Ine MEAN rrocegure 


Lower 95% Upper 95% 


Variable N Mean Median Std Dev Minimum Maximum Range CLforMean CLforMean Lower Quartile Upper Quartile N Miss 
Age 858 26.8205128 25.0000000 + 8.4979481 13.0000000 84.0000000 71.0000000 | 26.2510934 27.3899323 20.0000000 32.0000000 0 
Number of sexual partners 832 25276442 2.0000000 16677605 1.0000000 28.0000000 27.0000000 24141554 2.6411331 2.0000000 3.0000000 26 
Smokes 845  0.1455621 0 0.3528756 O 10000000 10000000 0.1217354 0.1693889 0 0 13 
Smokes (years) 845 = 12197214 0 40890169 0 37.0000000 37.0000000 0.9436243 1.4958185 0 0 13 
Smokes (packs/year) 845 0.4531440 0 2.2266098 0 37.0000000 37.0000000 0.3027996 0.6034883 0 0 13 
Hormonal Contraceptives 750 0.6413333 1.0000000 0.4799292 0 1.0000000 1.0000000 0.6069303 0.6757364 0 1.0000000 108 
Hormonal Contraceptives (years) 750 2.2564192 | 0.5000000 3.7642535 0 30.0000000 : 30.0000000 1.9865840 2.5262545 0 3.0000000 108 
STDs 753  0.1049137 0 0.3066458 0 1.0000000 1.0000000 0.0829762 0.1268512 0 0 105 
STDs (number) 753 0.1766268 0 0.5619928 0 4.0000000 4.0000000 0.1364217 0.2168319 0 0 105 
STDs:condylomatosis 753 0.0584329 0 0.2347162 O 1.0000000  1.0000000 0.0416413 0.0752246 0 0 105 
STDs:cervical condylomatosis 753 0 0 0 0 0 0 0 0 105 
STDs:vaginal condylomatosis 753 0.0053121 0 0.0727385 0 1.0000000 1.0000000 0.000108356 0.0105158 0 0 105 
STDs:vulvo-perineal condylomatos 753 0.0571049 0 0.2321972 O 10000000  1.0000000 0.0404935 0.0737163 0 0 105 
STDs:syphilis 753  0.0239044 0 0.1528528 0 1.0000000  1.0000000 0.0129693 0.0348395 0 0 105 
STDs:pelvic inflammatory disease 753 0.0013280 0 0.0364420 0 10000000 10000000 -0.0012790 0.0039351 0 0 105 
STDs. genital herpes 753 0.0013280 0 0.0364420 0 1.0000000 1.0000000 -0.0012790 0.0039351 0 0 105 
STDs:molluscum contagiosum 753  0.0013280 0 0.0364420 0 10000000 10000000 -0.0012790 0.0039351 0 0 105 
STDs:AIDS 753 0 0 0 0 0 0 0 0 105 
STDs:HIV 753 0.0239044 0 01528528 0 1.0000000 1.0000000 0.0129693 0.0348395 0 0 105 
STDs:Hepatitis B 753 0.0013280 0 0.0364420 O 10000000 10000000 -0.0012790 0.0039351 0 0 105 
STDs:HPV 753 0.0026560 0 0.0515025 0 1.0000000 1.0000000 -0.0010285 0.0063405 0 0 105 
STDs: Number of diagnosis 858 0.0874126 0 03025447 0 3.0000000  3.0000000 0.0671401 0.1076851 0 0 0 
Dx:Cancer 858  0.0209790 0 0.1433976 O 10000000 10000000 0.0113704 0.0305876 0 0 0 
Dx:CIN 858 0.0104895 0 0.1019392 0 1.0000000  1.0000000 0.0036589 0.0173201 0 0 0 
Dx:HPV 858  0.0209790 0 0.1433976 0 10000000 1.0000000 0.0113704 0,0305876 0 0 0 
Dx 858  0.0279720 0 0.1649888 0 41.0000000  1.0000000 0.0169167 0.0390274 0 0 0 
Hinselmann 858 0.0407925 0 0.1979246 0 10000000  1.0000000 0.0275303 0.0540548 0 0 0 
Schiller 858 0.0862471 0 0.2808923 0 1.0000000 1.0000000 0.0674254 0.1050688 0 0 0 
Citology 858  0.0512821 0 0.2207011 O 41.0000000  1.0000000 0.0364936 0.0660705 0 0 0 
Biopsy 858  0.0641026 0 02450784 0 10000000 1.0000000 0.0476807 0.0805245 0 0 0 


Figure 5: Output of PROC MEANS Procedure. 








The UNIVARIATE Procedure 
Variable: Age 
Quantiles (Definition 5) 
caret Level Quantil 
N 858 | Sum Weights 858 =. — 
Mean 26.8205128 | Sum Observations 23012 + 100% Max | 84 
Std Deviation 849794807 Variance 72.2151213 99% 50 
Skewness 1.39427877 Kurtosis 4.77857515 95% 41 
Uncorrected SS 679082 Corrected SS 61888.359 90% | 37 
Coeff Variation 31.68451 Std Error Mean 0.29011516 75% Q3 32 
50% Median 25 
Basic Statistical Measures 
ae 25% Q1 20 
Location Variability 
10% 18 
Mean 26.82051 Std Deviation 8.49795 
Median 25.00000 Variance 72.21512 5% 16 
Mode 23.00000 Range 71.00000 1% 15 
interquartile Range 12.00000 0% Min 13 
Figure 6: Descriptive Statistic Values. 
Distribution of Age 
= PROC UNIVARIATE DATA = cervical cancer; 
VAR AGE; 
HISTOGRAM AGE; 
2 | RUN; 
15 
5 
10 
| Ea anai 
1 1 22 3 3 42 475 6 6 e 
Age 


Figure 7: Histogram of Age. 


es $ ods graphics / reset width=6.4in height=4.8in imagemap; 
< proc sgplot data= work.cervical_cancer; 
vbox Age; 
z yaxis grid; 
run; 
ods graphics / reset; 
@ o 
m 8 
> 


40 


Figure 8: Boxplot of Age. 


2) Number of sexual partners 


By applying PROC UNIVARIATE and PROC MEANS, be- 
low outputs of Fig.9 and Fig.10 are acquired. From Fig.9, 
descriptive statistics values are computed. The mean, median 
and mode are roughly same, 2.53, 2.0, and 2.0 respectively. 
The distribution is showing positive tendency (skewness = 


5.46), however, the Kurtosis has far exceeded normal value, 
which indicates the distribution curve is intensively central- 
ized around the mean. Due to this, standard deviation and 
variance is relatively small (std = 1.67, variance = 2.78). 
The IQR is 1.0, with Ist quartile equals to 2 and 3rd quar- 
tile equals to 3, making high and low limitation for outlier 
identifier being 4.5 and 0.5. Hence, around 5 percentage of 
the value can be regarded as an outlier. Besides, 26 missing 
values detected. 





The UNIVARIATE Procedure 
Variable: Number_of_sexual_partners 


Moments 
N 832 Sum Weights 832 
Mean 2.52764423 Sum Observations 2103 
Std Deviation 1.66776048 Variance 2.78142501 
Skewness 5 45464861 Kurtosis 69 2047544 
Uncorrected SS 7627 Corrected SS 2311.36418 
Coeff Variation 659808234 Std Error Mean 0.05781919 
Basic Statistical Measures 
Location Variability 
Mean 2.527644 Std Deviation 1.66776 
Median 2.000000 Variance 2.78143 
Mode 2.000000 Range 27.00000 
Interquartile Range 1.00000 


The MEANS Procedure 


Analysis Variable : Number_of_sexual_partners 
N Miss 
26 


Quantiles (Definition 5) 
Level Quantile 
100% Max 28 
99% 

95% 

90% 

75% Q3 
50% Median 
25% Q1 
10% 

5% 

1% 

0% Min 


= a d ål NH WwW & OF WS 


Figure 9: Descriptive Statistics of Number of sexual part- 
ners. 


Distribution of Number_of_sexual_partners 


478 


40 \ 





Percent 
3 





| 11 2 1 1 
o- = 
2 4 6 8 10 12 14 #146 18 20 22 2 2 2 
Number of sexual partners 


PROC UNIVARIATE DATA = cervical_cancer; 
VAR Number_of_sexual_partners; 


HISTOGRAM Number_of_sexual_partners/ NORMAL BARLABEL=COUNT; 
RUN; 


Figure 10: Histogram for Number of sexual partners. 
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Fig.10 is the Histogram for Number of sexual partners. As 
can tell from the graph, the distribution curve is showing a 
thin tail, and most of the values are tightly distributed sur- 


frequency for Smokes 





rounding the mean. Some outliers can be identified on the 
right side of the X-axis. 
e 400 
The UNIVARIATE Procedure 
m Q-Q Plot for Number _of_sexual_partners zia 
o 
5 z% 0 
g (0 1 
2 ° Smokes 
x 10 © 
5 a”. Figure 13: Histogram of Smokes. 
z 0 0 o 
Distribution of Smokes __years_ 
100 
10 749 
4 3 -2 “1 0 1 2 3 4 
80 
Normal Quantiles 
Normal Line Mu=2 5276. Sigma=1 6678 
PROC UNIVARIATE DATA = cervical cancer; g T 
VAR Number of sexual partners; Š 
qqaplot Number_of_sexual partners / normal (mu=est sigma=est color=red l=2); = 
RUN;| 40 
. in | 
Figure 11: Q-Q Plot of Number of sexual partners. 20 
19 19 `o. 14 
F . 7 ° | 11 4 4 1 1 : 2 1 
Fig.11 is the Quantile-Quantile plot for Number of sexual ee aa a e a 


Smokes_ years_ 


partners. As can tell from the graph upper right part of the 
data value has deviated from the normal reference line. This 
matches the deduction from descriptive statistics that there Figure 14: Histogram of Smokes (years). 
are some outliers exists at the upper end of Number of sexual 





Normal(Mu=1 2197 Sigma=4.089) 


partners variation range. 4) Smokes (years) 
3) Smokes 
Variable: Smokes __years_ 
Moments 
Cumulative Cumulative N 845 | Sum Weights 845 
Smokes Frequency Percent Frequency Percent Mean 1.21972141 | Sum Observations | 1030.66459 
Std Deviation 408901694 Variance 16.7200595 
0 T22 65.44 122 65.44 
Skewness 44654839 | Kurtosis 23.7684182 
1 123 14.56 845 100.00 Uncorrected SS | 15368.8539 | Corrected SS 14111.7302 
Frequency Missing = 13 Coeff Variation | 335.241875 Std Error Mean 0.14066646 
Basic Statistical Measures 
PROC FREQ data=CERVICAL_CANCER; Location Variability 
table Smokes; Mean | 1.219721 | Std Deviation 4.08902 
run; Median | 0.000000 Variance 16.72006 
ae Mode | 0.000000 Range 37.00000 
Figure 12: Frequency Distribution of Smokes. EEEE Canes > 


Since Smokes is a binary variable indicating YES/NO, it 
makes no meaning to analyzing central tendency. Hence, 
PROC FREQ is applied to understand frequency distribu- Missing aes = 
tion. As seen in Fig.12, 85.44 percentage of the observations | eet | | 
are 0, and 14.56 percentage is 1. There are 13 missing values a kas siis 
in the dataset. The histogram in Fig13 & Fig.14 is visualizing 
the same trend. 


eS E a 
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Figure 15: Descriptive Statistics of Smokes (years). 
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Fig.15 shows the descriptive stats of Smoke (years). Ther 
are 13 missing values in this attribute. Amongst the 845 re- 
sponded records, more than three quarter answered 0 to this 
attribute. Hence there is a high tendency of skewness and 
kurtosis. The mean is 1.22 and median, mode are 0. IQR 1s 0, 
which indicates most of the value points are tightly distrib- 
uted around the median. Fig.19 is showing the distribution 
histogram, which indicates the same trend. 





5) Smokes (packs/year) 


From Fig 16 details of descriptive statistics data, the mean/ 
median/mode all-around 0, the standard deviation 1s 2.22 and 
range is 37. The skewness and kurtosis are all far exceeding 
normal level, which means the distribution curve is highly 
positively skewed and tightly centralized surrounding the 
mean. There are 13 incomplete values. 


Plotting the histogram as seen in Fig.17, right skewness and 
thin kurtosis are noticed from the distribution curve. Besides, 
some outliers also noticed. 


The UNIVARIATE Procedure 
Variable: Smokes__packs_year_ 


Moments 
N 845 Sum Weights 845 Quantiles (Definition 5) 
Mean 0.45314395 Sum Observations 382.906638 Level Quantile 
Std Deviation 22266098 Variance 495779121 100% Max | 37.000000 
Skewness 930880616 Kurtosis 114.839708 99% 9.000000 
Uncorrected SS 435788761 Corrected SS 4184.37578 95% 2.500000 
Coeff Variation 491.369199 Std Error Mean 0.07659771 90% 0.513202 
75% Q3 0.000000 
Basic Statistical Measures 50% Median 0.000000 
Location Variability 25% Q1 0.000000 
Mean 0.453144 Std Deviation 2.22661 10% 0.000000 
Median 0.000000 Variance 4.95779 5% 0.000000 
Mode 0.000000 Range 37.00000 1% 0.000000 
interquartile Range 0 0% Min 0.000000 
Missing Values 
Missing Percent Of 
Value Count AllObs Missing Obs 
13 1.52 100.00 


Figure 16: Descriptive Statistics for Smokes (packs/year). 
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6) STDs 
frequency for STDs 
80 
60 
a 40 
20 
0 
0 1 
STDs 
The FREQ Procedure 
Cumulative Cumulative 
STDs Frequency Percent Frequency Percent 
0 674 89.51 674 89.51 
1 79 10.49 753 100.00 


Frequency Missing = 105 


Figure 18: Frequency for STDs. 


The frequency graph and frequency distribution for STDs 
can be seen from Fig.18. As can tell from the histogram, 
more observations have replied 0 comparing with 1. To be 
concrete, 10.49 percentage of the values are with 1 for STDs, 
while 89.51 percentage with 0. There are 105 missing values 
detected in STDs. 


7) STDs (number) 


As can tell from Fig.19 and Fig.20, the mean, mode and me- 
dian all closely around 0. For this attribute, there are a lot 
of missing values involved. And most of the observations 
(roughly 90%) are giving 0 as feedback. The range is 4, and 
the maximum value appears in 4. 





The UNIVARIATE Procedure 
Variable: STDs__number_ 
Moments 
N 753 Sum Weights 753 
Mean 0.17662683 Sum Observations 133 


Std Deviation 
Skewness 
Uncorrected SS 


Coeff Variation 


0.56199284 
3.40284907 

261 
318.180906 


Variance 
Kurtosis 
Corrected SS 
Std Error Mean 


Basic Statistical Measures 
Location Variability 
Mean 0.176627 Std Deviation 
Median 0.000000 Variance 
Mode 0.000000 Range 


0.31583595 
11.5512402 
237 508632 
0.02048016 


0.56199 
0.31584 
4.00000 


Percent 
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Figure 17: Histogram for Smokes (packs/year). 
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Figure 19: Descriptive Statistics for STDs (number). 
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histogram for STDs__number_ run 
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Figure 20: Histogram for STDs (number). 


8) STDs: syphilis 


Fig.21 depicted the frequency distribution table as well as 
distribution histogram for STDs_ syphilis. 2.39 percentage of 
the observations responded with 1 and rest are noted with 0. 
There are 105 missing values in this attribute. 


frequency for STDs_syphilis 


Percent 


0 
STDs_syphilis 


The FREQ Procedure 


Cumulative Cumulative 
STDs_syphilis Frequency Percent Frequency Percent 


0 735 97 61 735 97.61 
1 18 2.39 753 100.00 
Frequency Missing = 105 


Figure 21: Frequency for STDs: syphilis. 


9) STDs: Number of diagnoses 


histogram for STDs__Number_of_diagnosis run 


En 


STDs__Number_of_diagnosis 


The FREQ Procedure 
Cumulative Cumulative 
$TDs_Number_of_diagnosis Frequency Percent Frequency Percent 
0 787 91.72 787 9172 
1 68 793 855 99 65 
2 2 0.23 857 99 88 
3 1 0.12 858 100.00 


Figure 22: Frequency for STDs: Number of diagnoses. 


As can tell from Fig.22, highest value for STDs: Number of 
diagnoses appears in 3 and the majority responded with 0. 
No missing value for this attribute. 


10) STDs: Time since the last diagnosis 


Fig.23 shows the distribution of the values for STDs: Time 
since the last diagnosis. As can tell from the histogram, 787 
of the observations did not provide respond for this attribute. 
Among the values provided, the highest value resides in 9 
(which appeared one time) and the lowest value is 1 (which 
appeared 28 times). Due to 91.72 per cent of the value is 
missing, for this attribute, it requires either data cleaning 
or this attribute shall be ignored, otherwise, the predictive 
model will be highly biased. 





Histogram for STDs__Time_since last_diagnosis 


The FREQ Procedure 


STOs__ Time since last_diagnosis Frequency F 
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Figure 23: Frequency for STDs: Time since last diagnosis. 


11) Dx: Cancer, CIN, HPV, Dx 


The FREQ Procedure The FREQ Procedure 


Cumulative Cumulative 


Cumulative | Cumulative Dx_CIN Frequency Percent Frequency 


Ox_Cancer Frequency Percent Frequency Percent 


0 840 97 90 R40 97 90 0 849 98 95 849 98 65 
1 18 210 858 100.00 1 a 16 858 
The FREQ Procedure The FREQ Procedure 
Cumulative Cumulative Cumulative Cumulative 
Dx_HPV Frequency Percent Frequency Pero Ox Frequency Percent Frequency Percent 
0 840 97.90 840 97 90 0 8H 


97 20 834 97 20 

1 18 210 858 100.00 4 24 280 858 100.00 
Figure 24: Frequency Distribution for Dx: Cancer, CIN, HPV, 
Dx. 


Dx: Cancer, CIN, HPV and Dx will be recognized as the tar- 
get variable, which indicates whether or not the patient is 
diagnosed with cancer. As can see from the distribution Fig 
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24, around 2 percentage of the collected sample data is di- 
agnosed with either kind of cancer. There are some missing 
values noticed as well. 


DISCUSSION 


There are 858 observations in the target dataset, of which 
2.1 percentage is infected with Cancer, 1.05 percentage with 


1. Data Cleaning 


CIN, 2.1 percentage with HPV and another 2.8 percentage 
with another kind of cancer. However, there are a large num- 
ber of missing values in the dataset, which will largely af- 
fect the modelling performance. The incomplete values are 
mainly seen in columns requiring more private information 
(No missing value detected in an attribute such as Age). To 
achieve better model accuracy, Data Pre-processing and 
transformation is necessary. 





Table 2: Variables identified with quality issue need to be pre-processed 





No Variables Attributes incomplete inconsistent outlier 
Types 

1 Age Ratio 

2 Number of sexual partners Ratio Y 

3 First sexual intercourse Ratio Y 

4 Number of pregnancies Ratio Y 

5 Smokes Nominal Y 

6 Smokes (years) Ratio Y Y 

7 Smokes (packs/year) Ratio Y Y 

8 Hormonal Contraceptives Nominal Y 

9 Hormonal Contraceptives (years) Ratio y Y 

10 IUD Nominal Y 

11 IUD (years) Ratio Y Y 

12 STDs Nominal Y 

3 STDs (number) Ratio Y 

14 STDs: condylomatosis Nominal Y 

15 STDs: cervical condylomatosis Nominal x 

16 STDs: vaginal condylomatosis Nominal Y 

17 STDs: vulvo-perineal condylomatosis Nominal Y 

18 STDs: syphilis Nominal Y 

19 STDs: pelvic inflammatory diseases Nominal 7. 

20 STDs: genetical herpes Nominal Y 

ži STDs: molluscum contagiosum Nominal Y 

22 STDs: AIDS Nominal Y 

23 STDs: HIV Nominal Y 

24 STDs: Hepatises B Nominal Y 

25 STDs: HPV Nominal Y 

26 STDs: Number of diagnosis Ratio 

27 STDs: Time since the first diagnosis Ratio x x 

28 STDs: Time since the last diagnosis Ratio T ti 

29 Dx: Cancer Nominal 

30 Dx: CIN Nominal 

31 Dx: HPV Nominal 

32 DX Nominal 

33 Hinselmann Nominal 

34 Schiller Nominal 

35 Citology Nominal 

36 Biopsy Nominal 
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There are mainly three types of data quality issues which 
requires for cleansing, including missing data (incomplete 
data), outlier (noisy data), inconsistent data. As per the work 
performed in data exploration, attributes with data quality is- 
sues are identified as indicated. Since data quality will have 
direct impact on model performance, each type of issue will 
be handled respectively before loading the data for further 
analysis. (See in Table 2) 





Incomplete Data 

This dataset contained a large volume of missing data. Com- 
monly, there are many reasons which would lead to data in- 
completeness: 1). Malfunction of the machine or human op- 
erational mistake 2). Due to privacy concern, the interviewee 
did not provide the information. For this dataset, the latter 
one is considered as the main cause. 





There are many feasible approaches to handle missing val- 
ues. In regards to this dataset, the median value will be filled 
as a substitute. 


Below is the code for processing data incompleteness: 


1.1 Inconsistent Data 


Table 3: Attributes with inconsistent issue 


No Variables Attributes Inconsistent 





Types 


1 Smokes (years) Ratio Y 

2 Smokes (packs/year) Ratio Y 

3 Hormonal Ratio b4 
Contraceptives(years) 

4 IUD (years) Ratio Y 


Inconsistent data is referring data which does not fit into the 
current format. This is mainly due to data integration from 
the heterogenous data source. Different format data does not 
consistent with each other and cause to data redundancy and 
data inaccuracy. In this dataset, some attributes for counting 
of years 1s supposed to be an integer, however, in the given 
dataset is shown with unreasonable precision. To handle this, 
will round the values in these attributes and get an integer. 
Processing code as shown in Fig.25. 





/*inconsistent data*/ 





data cervical_cancer2; 

set cervical _cancer3; 

Smokes __years_ = round(Smokes_ years ,1); 

Smokes__packs_year_ = round(Smokes__packs_year_,1); 

Hormonal Contraceptives __years_ = round(Harmonal_Contraceptives_years_,1); 
IUD__years_ = round(IUD_years_,1); 


Figure 25: Code for adjusting precision for inconsistent attrib- 
utes. 


Outlier 


Table 4: Attributes with suspect outliers 





No Variables Attributes Outlier 
Types 
1 STDs: Time since the first Ratio Y 
diagnosis 
2 STDs: Time since the last Ratio Y 
diagnosis 


The outlier is referring to the numeric value in the dataset 
which far deviates from other values. In this dataset, below at- 
tributes STDs: Time since first diagnosis and STDs: Time since 
the last diagnosis is suspected with outliers, (Table 4) because 
the years provided in some of the observations are equal to 
or greater than the Age value. The responder purposely gave 
the wrong answer or there is some malfunction in the data 
collection process. Through inspection, only two records are 
detected with this kind of situation, hence will remove it. The 
total dataset size is now 856 instead of 858. (Fig.26) 


/*outlier*/ 

data cervical _cancer3; 

set cervical _cancer2; 

if STDs_Time_since _first_diagnosis >= age then delete; 
if STDs_Time_since last_diagnosis >= age then delete; 


Figure 26: Code for handling outline. 


Dataset Transformation 

For better modelling performance, enhanced visualization 
effect and improved computational power, data transfor- 
mation is required. There are mainly two objectives in data 
transformation, one is to generate an intermediate variable 
for analysing, the other is for fixing missing value and skew- 
ness problem. There are many transformation techniques, in 
this dataset, two main approaches will be applied: 1). Gen- 
eralization 2). Feature scaling. Table 5 is the summarization 
table for variables which will be transformed, and also which 
type of transformation will be applied. 


Table 5: Attributes to be transformed and the trans- 
formation method 


No Variables Attributes Transforma- 
Types tion Method 
1 Age Ratio Generalization 
P Number of sexual Ratio Generalization 
partners 
3 First sexual intercourse Ratio Generalization 
Generalization 


This is also called discretization, to discretize values into or- 
dinal groups. Basing on the extreme value and number of 
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instances, bin number and bin width can be decided, then 
each of the value is pumped into specific groups. 


(1) Age 
The FREQ Procedure 

Cumulative Cumulative 

Age Frequency Percent Frequency Percent 
13 1 0.12 1 0.12 
14 5 0.58 6 0.70 
15 21 2.45 27 3.15 
16 23 2.68 50 5.83 
17 35 4.08 85 9.91 
18 50 5.83 135 15.73 
19 44 5.13 179 20.86 
20 45 5.24 224 26.11 
21 46 5.36 270 31.47 
22 30 3.50 300 34.97 
23 54 6.29 354 41.26 
24 39 4.55 393 45.80 
25 39 4.55 432 50.35 
26 38 4.43 470 54.78 
27 33 3.85 503 58.62 


Figure 27: Frequency Table for Age. 


Fig.27 is the frequency table for Age. There are 858 data en- 
tries, the value range is 84 — 13 = 71. Considering the data 
range and dataset size, choose to discretize with 10 bins, each 
bin with roughly 86 instances. Fig.28 shows the code for dis- 
cretization as well as output variable after transformation: 


data CERVICAL_CANCER_transform; 

set CERVICAL_CANCER; 

if age le 17 then age_transformed =1; 
else if age le 19 then age_transformed 
else if age le 21 then age_transformed 
else if age le 23 then age_transformed 
else if age le 25 then age_transformed 
else if age le 28 then age_transformed 
else if age le 30 then age_transformed 
else if age le 34 then age_transformed 
else if age le 37 then age_transformed 
else age_transformed = 10; 


N 
Woon aU A U 
w 


w. 


“we . we we Wwe 


we 


Age age_transformed a 


Figure 28: Code for Transformation and Age after transformation. 


(2) Perform the same to the rest variables. Detailed codes 
can be found in the appendix. Fig.29 shows the output after 
transformation. 


age_transformed 


2 3 | 


Number_of_sexual_p_transformed First_sexual_inter_transform 


) N 





Figure 29: Output variables after transformation. 


1.2 Feature scaling 

Feature scaling techniques are for scale down the range of 
the variable, so latter one when applying machine learning 
algorithm on it, computational power can be enhanced. Be- 
sides, some algorithms such as neural network, SVM can 
only work with a number between zero and one, scale down 
the feature and get it normalized is a necessary step. (Table 


6) 





Table 6: Attributes to be transformed with feature 
scaling 


No Variables Transformation 


Method 


Attributes 


Types 





1 Number of preg- Ratio Feature Scaling 
nancies 

2 IUD (years) Ratio Feature Scaling 

3 STDs (number) Ratio Feature Scaling 

4 STDs: Numberof Ratio Feature Scaling 
diagnosis 


The techniques applied here will be normalization. The final 
normalized value would be the current value be subtracted 
with a minimum value of the entire column and divided by 
the subtraction of maximum and minimum value: 


Equation (1): 


normalized value = x — xmin/xmax - xmin 


(1) Number of pregnancies 


For the number of pregnancies, firstly use PROC FREQ get 
the maximum and minimum value, which is 0 and 8, so each 
value will be subtracted with 0 and divided by 8. Processing 
code as follows (Fig.30): 


data CERVICAL_CANCER_transform; 
set CERVICAL_CANCER_transform; 
Num_of_pregnancies_transform = (Num_of_pregnancies - @)/8; 


Figure 30: Normalization for Number of pregnancies. 
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(2) Rest of the variables are handled in the same method. 
Output after transformation as shown below (Fig.31): 


STDs__number_transform Num_of_pregnancies_transform 1UD__years_transform 


11 


=a = 


Figure 31: Normalization output. 


DISCUSSION 


Section 4 and 5 are dealing with data cleaning and data trans- 
formation. Both of which are necessary steps of data pre- 
processing before data can be loaded later to other analysing 
tools. In this task, SAS studio is employed for initial data 
exploration (understanding spread and trend), data cleaning 
as well as data transformation. In the process of pre-process- 
ing, the original data file has been modified and new attrib- 
utes are generated as the secondary analysing variable. The 
output after data cleaning is named as “X cleanup” and this 
has been used as the input for data transformation. The final 
output from data transformation is named as “X transfor- 
mation” (see in Fig.32). This file will be downloaded and 
exported for analysing afterwards. 





4 WORK 

FA CERVICAL_CANCER 

Fa CERVICAL_CANCER_CLEANUP 

FA CERVICAL_CANCER_TRANSFORM 


7 gJ g 


Figure 32: Output dataset from data pre-processing. 


Apache Hadoop 

Hortonworks Data Platform (HDP) is an open-sourced frame- 
work providing platform for Apache Hadoop distributed ar- 
chitecture. It is mainly used for handling data-at-rest. While 
Hortonworks DataFlow (HDF) provides an end-to-end data 
collection, curates and analysis platform for handling data in 
real-time. Hortonworks Sandbox is an easy access desktop 
portal for utilizing HDP and HDF. It is based on a single node 
VM and is integrated with solutions such as HDFS, YARN 
and HIVE" For analysing the cervical data, will upload the 
dataset onto Hadoop through HDP sandbox, and access ma- 
nipulate via HQL of HIVE. Below Fig.33 shows, the dash- 
board of sandbox after Linux VM machine is started. 
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Figure 33: Hortonwork Sandbox. 


Next, change to HIVE view and create a new database named 
“dm”. Code is shown as in Fig.34 and can see from the left 
menu side, the new database is created. After this, upload 
transformed dataset from local onto HIVE by entering into 
“Upload Table” option. As see in Fig.35, after specifying a 
local directory, few changes need to be applied: (1) In File 
Type option, check the box identifying there is subject title 
included in the first row of the dataset. (2) Store the dataset 
as Optimised Row Columnar file format (ORC) to get more 
efficient performance. (3) Change the database from default 
to “dm”. Now, the dataset has already been uploaded from 
local to HIVE. 


Next step, to utilize the ORC file, command in Fig.36 is ex- 
ecuted and the dataset is now stored as ORC file and named 
as “cervical cencer orc”. 


Database Explorer gS Query Editor 
default = Worksheet 
| 
Databases 
Siefault 


Sf oodmiart 


Figure 34: Creating a new Database. 
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Figure 35: Uploading local dataset. 
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Worksheet * x cervical_cancer sample * % 


| CREATE TABLE cervical cancer orc STORED AS ORC AS SELECT * FROM cervical cancer; 
2 | 


Figure 36: Code for creating ORC table. 


Data Mining Hypothesis and Analysis 

To visualize the dataset, and further study the relationships 
between each attribute will load the dataset into Tableau. 
There are several different approaches to perform data load- 
ing. For this assignment, will use Open Database Connectiv- 
ity (ODBC) as the bridge to connect Tableau application to 
the Database Management System (DBMS). ODBC permits 
applications with multiple concurrent connections with dif- 
ferent databases in a consistent way''. The following steps 
show the details for configuring ODBC to connect to HDP 
sandbox and load data from Hadoop to Visualization tool. 


Set up the ODBC connection 


rs OOBC Data Source Adrnanestrator (64-ba) x 
Ur OSN System OSN File DSN Ores Tacno Connection Posling About 
Syten Osta Sources 








Hostia) 


Aven ot Opies 


Test Results x< 
Test Results 
SUCCESS! 
Successfully connected to data source! 


COSC Version: 03.80 
Oriver Version: 2.1. 12.1017 
-bit 





Figure 37: Config for ODBC. 


As shown in Fig.37, firstly, call up ODBC for 64bits from 
windows system, then config for Hortonworks Hive DSN by 
providing host portal and username. Click to test the con- 
nection to ensure the connection is successfully established. 


Connecting HDP to Tableau 

Start-up Tableau, choose to load data from Hadoop Server 
(See in Fig.38). Provide Hortonwork host portal address and 
username, then to sign-in. 


Connect 


Hortonworks Hadoop Hive 


Enter information to sign in to the server: 
Type: HiveServer2 


Authentkaton: Username 














Figure 38: Connect Tableau to Hadoop. 


Next thing is to extract data from Hadoop server and get the 
table named “cervical cancer orc”. It can be accessed by 
setting schema as “dm” first, then extract table by the search 
for table name (as seen in Fig.39). 


Hypotheses and Analysis 


(1) Hypotheses 

Basing on the business understanding from Literature Re- 
view and data understanding from data exploration and anal- 
ysis in SAS and HIVE, below hypotheses are set up, and 
visualization techniques are applied. 


Hypothesis 1: There is a positive correlation between the 
consumption of cigarette and cancer infection rate. 


Hypothesis 2: The use of Intrauterine device (IUD) will re- 
duce the chance for cervical cancer infection rate. 


Hypothesis 3: Infection of HPV will cause a high risk of cer- 
vical cancer infection. 


Hypothesis 4: With the rise of the number of sexual partners, 
the chance of getting cervical cancer will also increase. 


Hypothesis 5: There is an ageing period most risky for fe- 
males to be infected with cervical cancer. 


(2) Analysis with visualization 


a) Cigarette consumption and cancer infection ratio 


As can tell from Fig.40, whether the patient is diagnosed with 
cancer 1s set as a dimension (0,1). Then average cigarette con- 
sumption years is visualized in the y-axis. As can tell for Dx: 
Cancer and Dx: HPV, patient-reported with disease infection 
showing higher cigarette consumption period. This trend does 
not apply for Dx and Dx CIN records with value for 1. Suspect 
this is due to limitation of record volume, hence plot Fig.41 
showing distribution chart. As can tell, total records for Dx 
positive response is 24 and Dx CIN positive the response is 9, 
therefore analysis result 1s highly biased and cannot be trusted. 
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Figure 39: Extract Table from Hadoop. 100 
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In light of the above, further study into the relationship be- DxHPV:0 DxXHPV:1 | DxHPV:1 | DxHPV:1 | DxHPV:0 DxHPV:1 | DxHPV:0 


tween cigarette and cancer infection rate basing on the data 
of Dx Cancer and Dx HPV. Will generate an intermediate 
variable: packs total consumption by multiple smoke years 
and smoke_packs_year. And then get the average cigarette 

consumption for infected patients and uninfected patients a eo 
respectively (As see in Fig.42). Total cigarette consumption 90 
amount for Dx HPV (1) records is higher than that of Dx 


Figure 41: Dataset distribution by Disease Category. 


1,369 


HPV (0). Same goes for Dx Cancer. i 
From smoker and non-smoker perspective to analysis = ” 
whether smoking habit will simulate the occurrence of can- B60 
cer, below Fig.43 is plotted. As can tell from the graph, the 2 
infection rate has raised from 2.04% to 2.46% between non- 3” 
smoker and smoker. 2 40 
a 
In conclusion, whether the patient is a smoker, how many È a0 
years she smokes, and the amount of cigarette consumption 
20 


all intensively related to cervical cancer infection rate. Hy- 
pothesis | is proved correct. i 
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Figure 40: Visualization between cigarette consumption years 0% 
and cancer infection. 


Figure 43: Infection rate among smoker vs non-smoker. 
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b) IUD usage and cancer infection rate 


Fig.44 shows an overview of IUD usage. There are 856 ob- 
servations, amongst which 117 responders did not provide 
the answer for IUD usage, 656 gave 0 and 83 of them gave 
1. Next, the infection rate of HPV is showcased in Fig.45. 
The infection rate of IUD user is higher than that of not IUD 
user, this might due to the record size as well. Hypothesis 2 
is rejected. 
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Figure 44: IUD record distribution. 
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Figure 45: Infection rate among different groups. 
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c) The infection rate of HPV and infection rate of cancer 


Fig.46 shows distribution of HPV infects and non-infected 
records. Totally 18 HPV (1) records comparing with 838 
HPV (0) records. Further, delve into the cancer infectious 
rate within each group. As see in Fig.47 HPV positive result 
is highly possible to be linked with Cancer positive result, 
the infection rate increased from 0.24% to 88.89%. 
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Figure 46: HPV infects vs non-infect records. 
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Figure 47: Cancer infection rate among HPV patient and non- 
HPV patient. 


d) Number of sexual partners and cancer infection rate 


Fig.48 shows the distribution of the number of sexual partners 
amongst the responders. As can tell the majority answered | 
to 3, and there some other records sparsely distributed along 
X-axis. After this, will use transformed variable number_of_ 
sexual _partner_transform as dimension. A roughly equal 
number of records are put into each bin. Then the infection 
rate is visualized as in Fig.49. There is an upward trend from 
Group one to Group three, raised from 1.46% to 1.69% to 
2.82%. This indicates with the number of sexual partners in- 
creasing, the risk of getting infected with cervical cancer 1s 
also increasing. 
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Figure 48: Number of Sexual Partners Distribution. 
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Figure 49: Infection rate among different groups 


e) Cancer infection rate against age distribution 


Fig.50 shows the distribution of Age. The majority of the 
responders are aged between 15 and 37. Next, will employ 
transformed variable Age transform, which has roughly the 
same bin width among the 10 age bins, to analysis infection 
rate within age groups. As can summarize from Fig.51 infec- 
tion rate is relatively low among Age group | to 3 (equal or 
younger than age 21), then raised a bit among Age group 4 
to 7 (equal or younger than age 30) and reaches a peak time 
among age group 8 to 10 (beyond 30). This indicates the risk 
of cancer infectious rate is positively related to age increase. 
In consideration of this, a woman should be more cautious 
about cervical cancer with age growing. 
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Figure 50: Age Distribution. 


100 
90% 
and 
60% 
ary 
i , 3 4 5 6 8 ) 10 


Figure 51: Infection rate among different age groups. 
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CONCLUSION 


Through this work, roughly more than 80% of the time is 
spent on data exploration and pre-processing. Data cleaning 
and transformation steps are necessary since during the work 
of data exploration, it has shown there are some data quality 
issues such as missing data, and noisy data. There are several 
issues detected with this dataset: 


Firstly, the size of the dataset is not large enough, which 
might result in biased prediction result. For example, when 
studying about the relation of cigarette consumption and 
cancer infection rate, the data pattern is showing adverse in 
comparison with the initial hypothesis and related literature 
understanding, which is caused by the sparsity of records. 


Secondly, the incompleteness of this dataset. For many at- 
tributes such as STD status and IUD usage situation, there 1s 
a large amount of missing value. This is most probably due 
to privacy and security concern from the patient. 


Thirdly, the veracity of the dataset. There are some attributes 
such as STDs: Time since first diagnosis and STDs: Time 
since the last diagnosis are identified with some unreason- 
able values. This might be caused by a human mistake in the 
data collection process or the responder intentionally gave 
the wrong answer in consideration of privacy security. 


Due to abovementioned reasons, in data analysing phase, 
the result sometimes conflicts with reasoning deduction. In 
consideration of this, further study in this field shall fix the 
above issues, then prediction accuracy can be improved. 


Since some attributes of this dataset are relating with quite 
sensitive information, in the fear of information leakage or 
out of embarrassment some of the responders gave a false 
answer or avoided answering. This highly jeopardized the 
genuineness of the data and affected the accuracy of predic- 
tion. To safeguard the privacy of patients and conform to le- 
gal regulatory, healthcare centres are expected to apply data- 
preserving techniques before publishing or sharing the data. 


As understand from literature, several factors will cause 
a high risk of cervical cancer. Basing on the theory, five 
hypotheses were set up, including factors of, cigarette con- 
sumption, IUD usage, HPV infection status, number of 
sexual partners and age group. For four of the hypotheses, 
visualization from data matches the assumption, except for 
IUD usage, which is suspected to be caused by limited re- 
cord volume. Proved insights can be drawn from this study 
are: cigarette consumption amount is positively related with 
cancer infection rate; For HPV infected patients, there is a 
higher chance for them to get infected with cervical cancer; 
The number of sexual partners will impact the risk of cer- 
vical cancer infection risk; Woman with elder age group 1s 
under higher infection risk of cervical cancer. A future study 
could collect more data relating to IUD factor and further 
study the relationship between cancer and IUD. 
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